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Abstract 

Learning vector representation for words is 
an important research field which may benefit 
many natural language processing tasks. Two 
limitations exist in nearly all available mod¬ 
els, which are the bias caused by the context 
definition and the lack of knowledge utiliza¬ 
tion. They are difficult to tackle because these 
algorithms are essentially unsupervised learn¬ 
ing approaches. Inspired by deep learning, 
the authors propose a supervised framework 
for learning vector representation of words to 
provide additional supervised fine tuning af¬ 
ter unsupervised learning. The framework 
is knowledge rich approacher and compatible 
with any numerical vectors word representa¬ 
tion. The authors perform both intrinsic eval¬ 
uation like attributional and relational similar¬ 
ity prediction and extrinsic evaluations like the 
sentence completion and sentiment analysis. 
Experiments results on 6 embeddings and 4 
tasks with 10 datasets show that the proposed 
fine tuning framework may significantly im¬ 
prove the quality of the vector representation 
of words. 


Learning a numerical vector to represent 
the semantic meaning of a word is a research 
topic of wide interests in computational lin¬ 
guistics. Various applications like information 
retrieval (Manning et al., 2008), sentiment analysis 
( |Maas et ah, 2011 1 ) and semantic role labeling 
( |Collobert et al., 201 la[ ) benefit from the vector 
representation of words. There are two classes 
of methods for vector representation learning of 
words, including distributional semantic mod¬ 
els (DSMs) and neural language models (word 


embedding). In DSMs, large sparse global co¬ 
occurrence matrix representing the context of 
words is first constructed , dimension reduction 
techniques such as SVD are then applied to 
find the low dimension representation of words 
( jClark, 201 2[ Turney et al., 2010). Neural language 
models learn the vector representation of words 
using artificial neural networks. This method is 
firstly studied in (jBengio et al., 20031), and many 


variants have been proposed dTurian et al., 2010 


Collobert et al., 2011 b[ [Mikolov et al., 2013a 


Mikolov et al., 2013bj Pennington et al., 2014). In 


neural language methods, the key issue is the formu¬ 
lation of the training target function, minimization 
or maximization of which may produce meaningful 
vector representation of words. Ideally, the training 
target function should reflect the objective of word 
representation learning, that is the semantic similar¬ 
ity between words represented by distance measures 
between word vectors is consistent with human 
cognition. The training target in the above works 
is to maximise the context prediction ability which 
is not directly related to the word representations 
learning objective, therefore they arc essentially 
unsupervised approaches working with a pseudo 
supervised trick. 


In both DSMs and neural language models, hu¬ 
man defined context for each word is needed. The 
selection of context may have a strong influence on 
the word vectors obtained, however, there is no gen¬ 
erally best context definition because emphasizing 
one perspective of context will result in the lack of 
other valuable information. The examples in Table 
Q] show that the context length have a impact on the 


































learned word vectors. The three models arc trained 
with the same settings except the size of context win¬ 
dow. 


Table 1: Top 5 nearest neighbours for “snake”. 


window size 5 

window size 7 

window size 9 

tarantula 

cobra 

cobra 

venomous 

tarantula 

lizard 

frog 

nonvenomous 

tarantula 

rattlesnake 

spider 

frog 

lizard 

python 

porcupine 


Many research works show that the marriage of 
unsupervised and supervised learning may achieve 
better performance. For example, in deep learn¬ 
ing, unsupervised feature learning arc employed to 
initialize the neural network to enhance the perfor¬ 
mance. This inspires the authors to introduce super¬ 
vised fine tuning framework to the word represen¬ 
tation learning, with the goal of addressing the bias 
problem resulted from context definition. 

Flowever, introducing supervised learning into 
word representation learning is not a trivial issue. 
The most challenging problem is how to automati¬ 
cally generate labeled data. Manually labelling data 
is impractical because it is difficult to determine the 
exact numerical value for semantic similarity, and 
the size of required training data is very large. Al¬ 
though lexical semantic resources are available, the 
knowledge graph is not directly applicable for super¬ 
vised learning. Another challenging problem is how 
to design a suitable learning algorithm so that the 
fine tuning will not disrupt too much of the context- 
based learning results. 

The supervised fine tuning learning framework 
proposed in this paper provides solutions to the 
above mentioned problems. The core idea behind 
the proposed framework is to use the ranking of 
word similarities instead of the exact similarity mea¬ 
sure values as the training target. First, an algorithm 
that automatically generates labeled data based on 
existing word embeddings and knowledge resources 
is proposed. Second, an inverse error weighted mini¬ 
batch stochastic gradient descent optimization algo¬ 
rithm is designed, which can effectively absorb the 
complementary knowledge to fine tune the word em¬ 
beddings without disrupting the original geometric 


similarity of word embeddings. 

The remainder of the paper is organized as fol¬ 
lows. In Section 2, related work on word represen¬ 
tation learning is briefly reviewed. The proposed su¬ 
pervised fine tuning framework is detailed in Section 
3. Evaluation and model analysis are given in Sec¬ 
tion 4. Section 5 concludes the paper. 


1 Related Work 


Vector representation learning for words has re¬ 
ceived considerable attentions in the past two 
decades. The methods reported in the literature can 
be classified into two categories, including distribu¬ 
tional semantic models and neural language models. 

Distributional semantic models are based on the 
principle that words with similar semantic meanings 


have similar contextual information (jHaiTis, 1954 


Firth, 1957| ). The statistical contextual information 


is usually summarized into a large sparse matrix, 
each row of which is the global contextual informa¬ 
tion of a word in the large corpus, where weighting 
scheme like tf-idf and PMI arc often used to remove 
the bias of frequency. Dimension reduction tech¬ 
niques like singular value decomposition is then ap¬ 
plied to the high dimensional sparse matrix to gen¬ 
erate a low dimensional dense matrix with the same 
number of rows as the original sparse matrix. Each 
row of the obtained low dimensional matrix is the 
vector representation of a word. 

The neural language models are based on ar¬ 
tificial neural networks, whose input is the con¬ 
text of a word and the parameters are the vec¬ 
tor representation of the words. The param¬ 
eters of the neural network are adjusted based 
on the training target and approximation algo¬ 
rithm. A variety of training targets have been 
proposed. For example, (Bengio et al., 2003 ) em¬ 
ploys sequence of words to predict the next 
word, dTurian et al., 2010] ) employs context to pre¬ 
dict the middle word, and ( |Mikolov et al., 2013a] ) 
predicts all context given the central word. For 
the approximation algorithm, hierarchical soft- 


max is borrowed from ( Morin and Bengio, 2005j ) 
and different methods to construct the hierarchi¬ 
cal tree are studied in (Mnih and Hinton, 2009b 


IMikolov et al., 2013b| ), negative sampling series ap¬ 


proaches are proposed in ( fCollobert et ah, 201 1b 























Mnih and Teh, 2012[lMikolov et al., 2013bj ). 


Besides the above data-based neural language 
models, there are methods that incorporates knowl¬ 
edge into the learning of neural language mod¬ 
els. The major training target of these knowl¬ 
edge powered models is still based on the con¬ 
textual information, and the knowledge is used as 
auxiliary resources. For example, flXu et al., 2014j ) 
propose a framework to add relational and cate¬ 
gorical knowledge as regularization of the orig¬ 
inal training target, and ( [Qiu et al., 2014 ) uti¬ 
lize the morphological knowledge as both addi¬ 
tional input representation and auxiliary supervi¬ 
sion. ( |Botha and Blunsom, 2014j ) trained a com¬ 
positional morphological vector representation, in 
which a word vector is the addition of its morpho¬ 
logical factor vectors. In spite of the introduction of 
knowledge into the learning algorithm, these meth¬ 
ods still suffer from the bias problem caused by the 
context definition. ( [Passos et al., 2014] ) proposed a 
lexicon infused word representation learning algo¬ 
rithm for named entity recognition which is built 
upon the existing skip-gram model. 

dBansal et al., 2014j > employ an ensemble on dif¬ 
ferent word representations which outperforms all 
individual word representation on the dependency 
parsing application, this suggests that the comple¬ 
mentary information exists in different word repre¬ 
sentations. However, employing ensemble classifier 
may dramatically increase the computational burden 
in prediction because it utilize multiple classifiers. 
The supervised fine tuning framework proposed in 
this paper may be treated as a compressed ensemble 
model, which directly compresses the complemen¬ 
tary information into an individual word representa¬ 
tion. Instead of training ensemble classifier on all 
the word embeddings, we may use the complemen¬ 
tary information by training an individual classifier 
on the new word embedding. 


2 Supervised Fine Tuning for Word 
Representations 

2.1 Overview 

To alleviate the bias problem mentioned above, this 
study propose a supervised fine tuning framework 
for word representation learning. The overview of 
the framework is shown in Figure [7] The proposed 


supervised fine tuning framework is build upon the 
established word embedding and lexical semantic 
resources. There are two parts in the fine tuning 
framework, including ranking data generation and 
supervised ranking fine tuning. 
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Figure 1: Supervised Fine Tuning Framework. 


In the ranking data generation phrase, the rank¬ 
ing information for every word is extracted from 
each individual word embedding, which is then in¬ 
tegrated by a score based multiple criteria fusion al¬ 
gorithm. Human summarized knowledge like Word- 
Net ( [Miller, 1995| ) may also be included in the data 
generation phrase. With the labeled training data 
obtained above, the supervised ranking learning al¬ 
gorithm may fine tune the original word vectors to 
encode the complementary knowledge from other 
word embeddings. 

2.2 Automatic Labeling of Training Data 

As mentioned in introduction, the training target is 
to learn the semantic similarity ranking instead of 
similarity measure itself. The goal of training data 
labeling is to generate the ranking of semantically 
similar words for each training word. 

2.2.1 Why Ranking ? 

The biggest challenge for supervised word rep¬ 
resentation fine tuning is the labelling of training 
data. First, it is difficult to define the exact simi¬ 
larity values between two words. Second, it is im¬ 
practical to manually label a huge number of train¬ 
ing data to support supervised word representation 




















































learning. In the previous studies, each method starts 
from scratch, without using the results of other em¬ 
beddings. In contrast, our work is built upon exist¬ 
ing word embeddings. Thus, labelling of training 
data can be conducted automatically using the exist¬ 
ing word embeddings. 

The similarity measure is affected by many fac¬ 
tors such as the dimensionality of the word vec¬ 
tors, the employed learning algorithms and the cor¬ 
pus size. An example from GloVe embedding 
(Pennington et al., 2014) may illustrate the phe¬ 
nomenon. The cosine similarity value between 
words “fish” and “salmon” in Glove embedding 300 
dimension version is 0.6596, and the value in Glove 
embedding 50 dimension version is 0.8340. Al¬ 
though the similarity values are quite different, the 
word “salmon” is the most similar word to word 
“fish” in both embeddings. This reveals that the 
ranking of similarity values is more robust than the 
similarity values itself. Inspired by this finding, the 
authors propose to employ ranking information as 
the supervised training targets 

2.2.2 Multiple Ranking Integration 

The similarity ranking information can be ex¬ 
tracted from existing word embeddings and is used 
as the training target. But the issue is which word 
embedding we should use. Even the state-of-art em¬ 
beddings may not always provide reliable ranking 
information. Table [2] shows some obvious errors in 
the related words of words “hill”, “run”, and “pa¬ 
per” in the GloVe word embedding. The words in 
bold arc the errors between the real related words 
and the number beside the word represents the rank¬ 
ing position. Obviously, errors exist even in the 
highly frequent words like “paper” and “hill”. To 
address this problem, the authors propose to enhance 
ranking information label using multiple word em¬ 
beddings. The employed multiple word embeddings 
should be trained with different algorithms, context 
definition and coipus. The complementary effect of 
different word embeddings will improve robustness 
and reliability of the ranking information. 

To integrate multiple ranking information 
sources, a score-based multiple criteria fusion algo¬ 
rithm is utilized. Generally, in score-based multiple 
criteria fusion algorithm, a common used score is 
defined to measure the credit of each candidate data 


Table 2: Errors in GloVe300 embedding. 


Hill 

Run 

Paper 

now, 57 

three, 12 

instead, 30 

known, 64 

only, 17 

put, 33 

woods, 70 

not, 30 

document,35 

hillside, 72 

come, 39 

books, 42 

called, 74 

seven, 47 

mirror, 53 

wood, 97 

take, 58 

publish, 57 


in every criteria. A combination algorithm is then 
utilized to aggregate the multiple score into one 
consensus score. Finally a ranking procedure is 
performed to rank the candidate data based on their 
consensus scores. In this study, the candidate data is 
the candidate words to be selected as labeled data, 
and the score to measure the credit of the candidate 
words is a normalized cosine similarity measure. 


credit^ (ujd) = 


cosine(uj r ,LOd ) 


( 1 ) 


max.{cosine(uj r ,Lj)\u ; € V} 

The score of word cc,; in the ranking for word u r 
may be obtained by Equation CO, where V repre¬ 
sents the whole vocabulary. In Equation (0, the cu¬ 
mulated score for each candidate word in all embed¬ 
dings is divided by the times the word pair appearing 
in these embeddings. 


score 


'M = 


YhE^Embeds Credit‘d (Wrf) 


( 2 ) 


Y^EeEmbeds € E) 

Where the I represents the indicator function, 
whose value is 1 if the input condition is satisfied. 
The data with high averaged score is believed to be 
the trustable knowledge, while the low score data 
should be the bias or errors. 


2.2.3 Algorithm 

The whole procedure of the labeled datagenera- 
tion and integration is given in Algorithm QDJ. 

2.2.4 Reduction of Unnecessary Data 

It is impractical to learn the full ranking for each 
word because the whole similarity matrix is ex¬ 
tremely large. The authors find that it is not neces¬ 
sary to employ full ranking as training data for each 
word. Intuitively, a word is unrelated to most of the 

'The authors’ implementation is based on sparse matrix and 
nested mapping because the size of matrix is very large. 









Algorithm 1 Labeled Data Generation and Integra¬ 
tion_ 

Input: VocabList V stores words index in a list 
2-D Count Matrix D corresponding to the word 
in V 

2-D Matrix C stores the times of word pair - se¬ 
lected 

2-D Matrix S stores the times of word pair ap¬ 
pearing 

List of normalized embeddings: embeddings 
Number of words to extract N 
data is a dictionary data structure stores the out¬ 
put 

Initialize D = 0, C = 0 

1 Calculate the score of data 

for all em in embeddings do 

for all w v in V do 

(w N ,s N ) «- top N of {cos em (w v ,Wi)\i <E 

Sjnax t 77lQx(sjv) 

for all u>i and Si in wn,sn do 
D\w Vl — Si/s max 
C[w v ,Wi}+ = 1 

end for 
end for 
end for 

2 Remove low score data 
for all w v i in V do 

for all w v 2 in V do 

if C[w v \, w V 2 ] <= 2 then 
D[w v i,w v2 ] = 0 

end if 
end for 
end for 

3 Remove bias of frequency 

D = D/S hint: point wise division 

for all w v in V do 

data[w v \ 4— select nonzeros in D\w v ] and sort 

end for 

Output: data stores the ranking of related words 


other words in the vocabulary, the word embeddings 
should also be consistent with this fact that most 
of the similarity values should indicate two words 
are unrelated. Additionally, adjusting whether “pen” 
or “water” is more similar to “basketball” does not 
make any sense since they are all totally unrelated. 


Based on the analysis above, it is reasonable to only 
care the meaningful word pairs with strong rela¬ 
tions between each other and ignore other unrelated 
words. Therefore, only the top ranked words in the 
ranking deserve further fine tuning. Without the con¬ 
sideration of the unrelated words, the data size of the 
supervised training may be greatly reduced to hun¬ 
dreds per words. In this study, the authors extract 
the top 200 most similar words as the candidate data 
for each word from each individual word embedding 
based on the empirical experience. 

2.3 Supervised Ranking Fine Tuning 
2.3.1 Training Target 

As mentioned in the Introduction, the training tar¬ 
get in most neural language models is to maximize 
the context prediction ability of word embeddings. 
Different from the previous approachers, the pro¬ 
posed tine tuning framework attempts to adjust the 
word vectors of a particular word embeddings so 
that the word similarity ranking is in line with the la¬ 
beled data. Fitting the ranking of similarity measure 
may directly alleviate the pain caused by the bias of 
context, therefore it is a supervised solution to the 
bias problem caused by context definition. In addi¬ 
tion, the ranking of semantic similarity is the most 
important property of the desired word representa¬ 
tion in most tasks. In this sense, it is a supervised 
solution to the fine tuning of word embeddings. 

To achieve the above goal, a ranking loss function 
is employed as the cost function. The ranking loss 
function J ran k is shown in Equation (0. 

drank = ^ ^ | LuJr ~ ^ O) 

0J V £V UJr£Du; v 

Where V is the vocabulary and D Wv is the related 
data set of word u v , L and R denote the ranking 
in the labeled data and the word embedding under 
study, respectively. 

Because the ranking loss is not differentiable, the 
authors choose to minimize the semantic similar¬ 
ity loss between the desired ranking position and 
the real ranking position. Given the desired rank¬ 
ing position, the similarity value corresponding to 
the desired ranking position is employed as the real 
training target. Minimizing the difference of simi¬ 
larity values between desired position and real po- 





sition may also reduce the ranking loss. The simi¬ 
larity value loss function J s imi is given in Equation 
|4j where S^ v denotes the sorted similarity values for 
word lo v 


Jsim.i — 


E E i^(^)-^(^ji 2 (4) 

Ul-uGVuJrGDuv 


2.3.2 Inverse error Weighted Mini-batch SGD 
Optimization 

Stochastic gradient descent(SGD) is a widely 
used optimization algorithm to minimize the loss 
function. The iterative update rule for parameters 
is shown in Equation ([5]). 


UJ = UJ — Tj X 


dJ(dj) 

du 


(5) 


Where uj and g denote the parameters and learn¬ 
ing rate, and is the approximation of gradient 

of the loss function J based on data di. Because the 
computation of similarity distribution is very expen¬ 
sive, mini-batch SGD is employed to speed up the 
learning process. In mini-batch SGD, the gradient 
of the loss function is approximated by a batch of 
data, which is shown in Equation ([6]). 


Qbatch — jy 


E 

i gbatch 


dJ{dj) 

Ouj 


( 6 ) 


uj = U}-t]X g batch 

Where N is the size of batch and gbatch denotes 
the batch gradient. However, the direct applying 
mini-batch SGD encounters a serious problem: the 
training loss is reduced, but the performance on all 
tasks deteriorates. This is quite similar to the over¬ 
fitting problem in which the models learn the noise 
of the training data and perform badly on unknown 
testing dateset. However, there are two differences 
between this problem and overfitting. The overfit¬ 
ting phenomenon occurs at the last phase and the 
model generally should have more parameters than 
necessary, this generalization decreasing problem 
happens at the start of the training and it exists even 
the size of parameters is very small. 

After exploration of the learning process, the au¬ 
thors find that the reason behind this problem is that 
the external knowledge of ranking information may 
disrupt the geometric similarity of the original word 


embedding under the standard mini-batch SGD op¬ 
timization algorithm. The existing word embed¬ 
dings are sill far from perfect, and some strongly 
related words in reality may be ignored, which may 
be caused by many factors such as polysemy, corpus 
bias and imbalanced word frequency. Such words 
may have very large ranking errors, then the gradi¬ 
ent of these words become very large because the 
magnitude of the gradient is proportional to the er¬ 
ror. Finally, the gradient of these words dominates 
the overall gradient and disrupt the original learned 
geometric similarity. 

In this fine tuning framework, the principle that 
the data with larger error deserves more learning is 
not applicable, the authors propose an inverse error 
weighted mini-batch SGD optimization algorithm to 
remove the effect of the large errors. 


Qbatch 


N 


E 

i^batch 


1 dJ(dj) 
e(di) dio 


(7) 


Equation [Tj shows the inverse error weighted gra¬ 
dient, where e(di) denotes the error of data d„,. 

Besides the update from the labeled data, the al¬ 
gorithm also deals with the data that are in the mean¬ 
ingful range but not covered by the labeled data. 
These data are close to the trained word in the word 
embedding, but actually not related to the word in 
reality. These data are also processed by the pro¬ 
posed inverse error weighted mini-batch SGD, but 
the errors are not included in the cost function. 

To determine the range of meaningful semantic 
similarity in original word embeddings, a threshold 
based on random similarity distribution is utilized. 
The idea behind this is that the similarity values 
which are difficult to be randomly generated are the 
real knowledge the word embedding learned. To cal¬ 
culate the threshold for specific word embedding, a 
random matrix following uniformly distribution and 
having the same shape with the word embedding is 
generated, then the mean of all the similarity val¬ 
ues in the top 5 ranking for all words is calculated 
as the random threshold. To speed up the genera¬ 
tion process, a sample may be used to approximate 
the whole distribution. The similarity values above 
the random threshold have a large probability to be 
meaningful data. 

The detail of the updating rules is provided in Al- 






gorithm[2j 


Algorithm 2 Ranking Learning Algorithm 

Input: vocabList V stores words index in a list 
and era is the initial embedding 
data is a two tier nested mapping stores all train¬ 
ing data 

data Wv and rank Wv are mappings between words 
and their ranking for w v in supervised label and 
trained embedding respectively 
ul is a list stores all the local update for word w v 
update is the global update vector for word w v 
5 is the random threshold and a is the learning 
rate 

Initialize error = len(V ) x d, stop = False 
while stop != True do 
for all w v in V do 

data Wv <— data[w v \ 

rank Wv get ranking of all words for w v 

ul i — 0 

for all Wd in data Wv do 

sign I(rank Wv [w d \ - data Wv [m d ]) 
ul.add(sign x cos (w v ,w d ) x em[w d \) 

end for 

ne Wv <r- select rank Wv if cosine > S and not 
in data Wv 

for all W d in ne Wv do 

ul. add(—1 x cos (w v ,Wd) x em[wd\) 

end for 

update <— mean of vectors in ul 
update <— n o no a I i z a t i o n ( up d a t e ) 
em[w v \ = em[w v \ + a x update 
em[w v \ = normalization (em[w v \) 

end for 
end while 


2.4 Stopping Criterion and Overfitting 

Overfitting is a frequently encountered problem in 
machine learning. Generally, overfitting is related 
to the number of parameters. If the size of parame¬ 
ters is larger than the potential patterns in the obser¬ 
vations, the model may suffer from the overfitting 
problem. The overfitting problem is more serious 
in this study because the automatically generated la¬ 
beled data may contain some noise and confliction. 
The widely employed stopping criterion is given in 
Equation ©. 


drank CO Jrankid T 1) 
drank (*) 


<= e 


(8) 


Where d ran k{i ) is the overall ranking loss in the 
ith epoch, which is the same as J ran k i n Equa¬ 
tion (0). ande is a very small value like 0.01. If 
the relative improvement is smaller than the prede¬ 
fined value e, the training should stop. However, 
the authors find that the criterion may not generalize 
well for the involved embeddings and tasks, even e 
is changed according to the number of parameters. 
The authors employ an alternative stopping crite¬ 
rion, which is related to the initial performance of 
the word embedding. Equation <[9l) shows the crite¬ 
rion, and the d simt {{)') is related to the performance 
of the embeddings on various tasks. With this cri¬ 
terion, the e selected according to the dimensional¬ 
ity of word embeddings and the initial performance 
may work better. This may be explained by the fact 
that the quality of different embeddings are quite dif¬ 
ferent, the poorly performed models may have more 
space to grow and the models with good initial status 
arc easier to fall into overfitting. 


Jranki.i') Jrankid T 1) 
drank (0) 


<= e 


(9) 


3 Experiment 

To evaluate the proposed supervised fine tuning 
framework, three groups of experiments are de¬ 
signed. 

The intrinsic and extrinsic evaluations experi¬ 
ments are conducted to test the effect of fine tuning. 
6 word embeddings are further trained by the pro¬ 
posed supervised tine tuning framework and eval¬ 
uated with 4 tasks. The baselines are the original 
word embeddings which includes many reputable 
word embeddings such as SENNA, Word2vec and 
GloVe. 

The effect of inverse error weighted mini-batch 
SGD optimization algorithm is studied in the sec¬ 
ond group experiments. Two comparison groups are 
involving, one uses the widely employed standard 
optimization algorithm and another utilizes the pro¬ 
posed inverse error weighted variant. The compari¬ 
son uses 6 embeddings and 3 tasks with 3 classical 
datasets, except the updating rules, all the other set¬ 
tings are the same for two comparison groups. 







The influence of data size on the model is studied 
in the last group experiments with two best embed¬ 
dings and 8 datasets, and the hypothesis about the 
reduced data size is tested. 

3.1 Evaluation Methods 

3.1.1 Semantic Similarity Prediction 

Measuring the consistency between machine pre¬ 
dicted similarity values and human annotated sim¬ 
ilarity values is the most widely employed task 
to evaluate the quality of word embeddings. The 
datasets are usually constructed from crowdsourcing 
cognitive experiments 

Many datasets arc available, the authors select 
5 of them to cover of different perspectives and 
data size. Wordsim353 ( [Finkelstein et al., 2001 j ) 
and RG65 (Rubenstein and Goodenough, 1965) are 
the two most used dataset for semantic simi¬ 
larity evaluation. MEN3000 (|Bruni et al., 2014[) 


and Mturk771 ( jHalawi et al., 2012| ) are two re¬ 
cently generated large datasets. YP130 dataset 
(Yang and Powers, 2006) is designed to evaluate se¬ 
mantic similarity between verbs. As in other works, 
spearman correlation is used to measure the consis¬ 
tency between human annotation and machine pre¬ 
diction. 

3.1.2 Analogical Reasoning 

The analogical reasoning task is designed to 
test the ability of models for relational similarity 
tasks. The Google analogical reasoning dataset is 
introduced by dMikolov et al., 2013a| . The ques¬ 
tion is in the following format, Man : Women = 
King : ?, the answer should be the exact word 
“Queen”. The Microsoft analogical reasoning data 
set ( [Mikolov et al., 2013c| ) is also utilized in this 
study. The major difference between these two date- 
sets is the different types of relations they focus on. 

As in other works, the nearest neighbour of the 
vector(Women + King - Man) exclusive of the three 
words in questions is selected as the candidate an¬ 
swer. If the candidate word is the same as the an¬ 
swer, the question is correctly solved. In this study, 
the overall accuracy is employed to evaluate the 
word embedding. Since the size of the learned word 
embedding is not as large as the one used in origi¬ 
nal study dMikolov et al., 2013a| ), the question with 
unknown word is not included in the experiment. 


3.1.3 Sentence Completion 

The sentence completion challenge 
( jZweig and Burges, 2012 1 is intended to stimu¬ 
late research in the area of semantic modeling. 
The sentence completion questions are to select 
words which are meaningful and coherent in the the 
context of a complete sentence. In each sentence, 
an infrequent word is chosen as the focus of the 
question and four alternates candidates were chosen 
from a list of words suggested by an N-gram 
language model as disturbance. Only the original 
word is considered as the correct answer to the 
question. 

In this study, the author select the candidate an¬ 
swer based on the average similarity values between 
the candidates and all the words in the sentence. The 
word with largest average value is selected as the an¬ 
swer. This is the widely employed approach when 
this dataset is utilized to evaluate the word embed¬ 
dings. 

3.1.4 Sentiment Analysis 

Two sentence level sentiment classification 
datasets are used in this task to do the extrin¬ 
sic evaluation. The customer reviews dataset 
( |Hu and Liu, 2004| ) contains the reviews of 5 digital 
products from Amazon web site, and movie review 
dataset ( Pang and Lee, 2005) includes 5331 positive 
and 5331 negative processed sentences from rotten 
tomatoes movie review web site. In both datasets, 
the review sentence should be classified as positive 
or negative attitude. 

In this study, the author employ the convolution 


neural network (CNN) as described in (Kim, 2014) 
to do the sentence level sentiment classification. 

3.2 Setting and Details 

6 word embeddings are studied in our ex¬ 
periments, they are HLBL50 which is 
proposed in (Mnjh and Hinton , 2009a), 
SENNA50 
RNNLM640 
GloVe300 

Dep2000 ( Fyshe et al., 2013j ) 


(Collobert et al., 2011b), 


(Mikolov et ah, 2011), 


(Pennington et al., 2014), 


DocAnd- 
and Word2vec 
dMikolov et al., 2013b| ). All embeddings are 
available in the website^l except the Word2vec 


“Senna: http://ml.nec-labs.com/senna/ 
RNNLM:http://www.fit. vutbr.cz/ imikolov/rnnlm/ 

















































embeddings. 

The authors train the Word2vec model by the pub¬ 
lic available Word2vec toolkit using the data from 
the 1 billion word language modeling benchmark 
dataset. Hierarchical softmax approximation algo¬ 
rithm and skip gram structure are employed, where 
the window size is set to 7 and the mini count 
of words is set to 10. DocAndDep models are 
constructed from two models which are document 
model and dependency model, the authors utilize 
them separately in this study, the document models 
is not further trained because the initial performance 
has a large gap with other embeddings. 

For human summarized knowledge resources, 
WordNet is employed in this experiment. The di¬ 
rectly connected words and the words in the same 
synsets are extracted as external knowledge. In the 
integration process, the weights of all relations from 
wordnet are treated equally as 1. 

The learning rate cr is set to 0.1 in most exper¬ 
iments. This is for the fast convergency because 
the computation of cosine similarity matrix in each 
epoch is very expensive. To avoid overfitting, the 
stopping criterion is set based on the embeddings’ 
dimensionality and original performance, the values 
of e for different embeddings is given in Table [3] 


Table 3: Stopping Criterion Setting. 

Name Senna50 Skip50 HLBL50 
Value 0.04 0.05 0.004 

Name Glove300 RNNLM640 DeplOOO 
Value 0.07 0.07 0.10 


The hyper-parameter setting for convolution neu¬ 
ral network used in sentiment analysis follows the 
default setting utilized in ( [Kim, 20l4j ) 


3.3 Result and Analysis 

3.3.1 Intrinsic and Extrinsic Evaluation 

The performance of the proposed supervised fine 
tuning framework is detailed in Table |4[ All word 
embeddings are significantly enhanced after the su¬ 
pervised fine tuning. The performance of HLBL50 
embedding is similar to the SENNA50 and Skip50 

HLBL:http://metaoptimize.com/projects/wordreprs/ 

Glove: http: //nip. Stanford. edu/proj ects/glove/ 
DocAndDep:http://www.cs.cmu.edu/ afyshe/papers/conll2013/ 


Table 5: Performance of GloVe on Sentiment Analysis. 


Classifier 

static 

non-static 

Task 

Original 

Trained 

Original 

Trained 

CR 

0.786 

0.797 

0.774 

0.791 

MR 

0.749 

0.761 

0.753 

0.757 


embeddings, although the initial performance of 
HLBL50 is not satisfactory. The performance of the 
best word embedding Glove300 in this experiment 
is also significantly improved in all datasets. These 
remarkable improvements may demonstrate that the 
supervised fine tuning framework may transfer the 
complementary knowledge from the weak embed¬ 
dings and lexical semantic resources into the strong 
embeddings. 

In the perspective of evaluation tasks, the result 
data supports the conclusion that this supervised fine 
tuning framework is effective for all the involved 
tasks. Due to the training label is strongly related to 
the similarity prediction tasks, the performance for 
all 5 datasets are significantly improved. The im¬ 
provement for the analogical reasoning task is not 
as large as the similarity prediction task, but also 
remarkable. An interesting finding is that the per¬ 
formance of sentence completion task forms two 
groups, the word embeddings in high dimension ob¬ 
tain more significant improvement than the low di¬ 
mensional word embeddings. The authors believe 
that this is caused by both the learning capability dif¬ 
ference and overfitting. 

Table 0 shows the result for sentiment analysis 
task. No matter the word vectors are directly used as 
the features or as an initialization of feature param¬ 
eters, the fine tuned GloVe word embedding outper¬ 
form the original one for both datasets. The may re¬ 
veal that the fine tuned word embeddings have better 
separation capability than the original embeddings 
for sentiment analysis tasks. 

3.3.2 Comparison of Standard SGD and 
Inverse Error Weighted SGD 

In the second experiment, the standard mini-batch 
SGD and the inverse error weighted mini-batch SGD 
are compared. The learning curve of both approach¬ 
es for all embeddings on three tasks are shown in 
Figure [2l [3] and |4j respectively. 

It is shown that nearly all the solid lines represent- 















Table 4: Performance of 6 word embeddings on 8 datasets. 


Dataset 

Senna50 

Skip50 

HLBL50 

Glove 3 00 

RNNLM640 

DeplOOO 

embedding 

before 

after 

before 

after 

before 

after 

before 

after 

before 

after 

before 

after 

WordSim353 

0.40 

0.56 

0.53 

0.55 

0.23 

0.53 

0.55 

0.62 

0.30 

0.55 

0.44 

0.59 

Mturk771 

0.48 

0.59 

0.52 

0.60 

0.280 

0.59 

0.63 

0.69 

0.40 

0.65 

0.61 

0.70 

RG65 

0.49 

0.61 

0.56 

0.60 

0.36 

0.66 

0.74 

0.77 

0.51 

0.65 

0.68 

0.76 

YP130 

0.16 

0.42 

0.35 

0.45 

0.23 

0.49 

0.55 

0.59 

0.40 

0.57 

0.56 

0.65 

M3k 

0.59 

0.68 

0.65 

0.73 

0.28 

0.62 

0.73 

0.77 

0.46 

0.65 

0.63 

0.73 

Google _AR 

0.128 

0.295 

0.202 

0.360 

0.098 

0.275 

0.479 

0.508 

0.324 

0.382 

0.206 

0.255 

MS_AR 

0.185 

0.411 

0.286 

0.451 

0.2 

0.441 

0.665 

0.685 

0.507 

0.538 

0.393 

0.429 

MS_SC 

0.329 

0.289 

0.30 

0.30 

0.26 

0.254 

0.335 

0.375 

0.259 

0.291 

0.359 

0.376 



Figure 2: Comparison of standard SGD and inverse error 
weighted SGD on M3K similarity prediction dataset.. 



epoch 


Figure 3: Comparison of standard SGD and inverse error 
weighted SGD on google analogical reasoning dataset. 

ing standard mini-batch SGD in these three figures 
have a trend of decreasing, which means the stan¬ 
dard mini-batch SGD does not help, but decreases 



Figure 4: Comparison of standard SGD and inverse error 
weighted SGD on sentence completion dataset. 

the performance. In contrast, the performance of in¬ 
verse error weighted approacher represented by dash 
lines have a general trend of increasing. This ex¬ 
periment demonstrates the phenomenon described 
above that the data with large error is too difficult to 
learn and may disrupt the original geometric similar¬ 
ity. With the obvious large performance gap in the 
comparison, the conclusion may be obtained that the 
inverse error weighted mini-batch SGD optimiza¬ 
tion algorithm is suitable for supervised fine tuning 
learning of word representation. 

3.3.3 Effect of Training Data Size 

Table [6] show the effect of different size of train¬ 
ing data. Obviously, the performance of the analog¬ 
ical reasoning tasks is improved slightly with the in¬ 
crease of training data size. However, the perfor¬ 
mance for similarity prediction tasks reveals two di¬ 
verse paths, the small datasets like RG65 and YP130 
























































Table 6: Effect of Data Size. 


Embedding 

GloVe300 

Skip50 

Data Size 

0 

50 

100 

150 

200 

0 

50 

100 

150 

200 

WordSim353 

0.55 

Bsa 


0.61 


0.53 


0.51 

0.51 


Mturk771 

0.63 

ESS 

0.69 

0.70 

0.69 

0.52 


0.62 

0.60 


RG65 

0.74 

0.80 

0.77 

0.70 

0.77 

0.56 


0.64 

0.63 

0.60 

YP130 

0.55 

0.62 

0.63 

0.59 

0.59 

0.35 

0.56 

0.52 

0.49 

0.45 

M3k 

0.73 

0.76 

0.75 

0.78 

0.77 

0.65 

0.73 

0.73 

0.73 

0.73 

Google_AR 

0.479 

0.472 

0.479 

0.494 

0.508 

0.202 

0.295 

0.33 

0.358 

0.360 

MS_AR 

0.665 

0.67 

0.69 

0.68 

0.685 

0.286 

0.395 


0.446 

0.451 

MS_SC 

0.335 

0.42 

0.39 

0.39 

0.375 

0.30 

0.31 


0.30 

0.30 


prefer small size of training data, and the perfor¬ 
mance of large datasets like M3K and Mturk771 is 
not affected by the size of training data. 

Generally, the increase of training data size is 
only slightly helpful to analogical reasoning tasks, 
so it empirically proves the hypothesis that employ¬ 
ing only top related data is enough to support this 
supervised fine tuning. 

4 Conclusion 

A supervised fine tuning framework is proposed to 
boost the unsupervised word representation learning 
algorithms. The proposed automatic label genera¬ 
tion approach and the inverse error weighted mini¬ 
batch SGD optimization algorithm make it possible 
to provide additional supervised fine tuning phase 
for all unsupervised word representation learning al¬ 
gorithms. Many experiments involving 10 datasets 
and 6 well trained embeddings empirically demon¬ 
strated that the framework is very effective for im¬ 
proving the quality of word representation. 
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